Machine Learning Analysis Pipeline
EDR: Dataset Loading & Preprocessing
EDR – Train/Test Overview
• Train shape: (88089, 20) | Test shape: (7533, 20)
• Total train samples: 88,089 | Total test samples: 7,533
• Number of features: 18
• Target column: 'label'
• Missing values (train): 0 | (test): 0
• Train shape: (88089, 20) | Test shape: (7533, 20)
• Total train samples: 88,089 | Total test samples: 7,533
• Number of features: 18
• Target column: 'label'
• Missing values (train): 0 | (test): 0
EDR – Train Class Distribution
• 0: 87,232
• 1: 857
• Class balance (minority/majority): 0.9824%
• 0: 87,232
• 1: 857
• Class balance (minority/majority): 0.9824%
EDR – Feature Preparation
• Target encoding: {0: 0, 1: 1}
• Data preprocessing: Infinite values handled, missing values filled with train medians
• Feature scaling: StandardScaler (fit on train, applied to test)
• Target encoding: {0: 0, 1: 1}
• Data preprocessing: Infinite values handled, missing values filled with train medians
• Feature scaling: StandardScaler (fit on train, applied to test)
⚠️ Extreme Class Imbalance Detected
• Minority class represents only 0.9824% of the data
• This extreme imbalance may cause models to predict everything as majority class
• Consider: more aggressive SMOTE ratios, cost-sensitive learning, or ensemble methods
• Metrics like Precision-Recall AUC and F1 are more meaningful than accuracy
• Minority class represents only 0.9824% of the data
• This extreme imbalance may cause models to predict everything as majority class
• Consider: more aggressive SMOTE ratios, cost-sensitive learning, or ensemble methods
• Metrics like Precision-Recall AUC and F1 are more meaningful than accuracy
Baseline (Most-Frequent) Accuracy: 0.9902
EDR: Model Performance Comparison
EDR – Model Performance Metrics
| Model | Accuracy | Balanced Acc | Precision | Recall | F1 | ROC-AUC | PR-AUC |
|---|---|---|---|---|---|---|---|
| Logistic Regression | 0.9627 | 0.5597 | 0.0480 | 0.1486 | 0.0726 | 0.6822 | 0.0512 |
| Random Forest (SMOTE) | 0.9896 | 0.5466 | 0.3889 | 0.0946 | 0.1522 | 0.7053 | 0.1069 |
| LightGBM | 0.9891 | 0.5396 | 0.3000 | 0.0811 | 0.1277 | 0.8057 | 0.0800 |
| Balanced RF | 0.9046 | 0.6641 | 0.0438 | 0.4189 | 0.0794 | 0.8454 | 0.0657 |
| SGD SVM | 0.9707 | 0.5637 | 0.0651 | 0.1486 | 0.0905 | nan | nan |
| IsolationForest | 0.9773 | 0.5470 | 0.0708 | 0.1081 | 0.0856 | nan | nan |
Confusion Matrix Analysis
| Model | TN | FP | FN | TP | FP Rate | Miss Rate |
|---|---|---|---|---|---|---|
| Logistic Regression | 7241 | 218 | 63 | 11 | 2.92% | 85.14% |
| Random Forest (SMOTE) | 7448 | 11 | 67 | 7 | 0.15% | 90.54% |
| LightGBM | 7445 | 14 | 68 | 6 | 0.19% | 91.89% |
| Balanced RF | 6783 | 676 | 43 | 31 | 9.06% | 58.11% |
| SGD SVM | 7301 | 158 | 63 | 11 | 2.12% | 85.14% |
| IsolationForest | 7354 | 105 | 66 | 8 | 1.41% | 89.19% |
Best Models by Metric
Accuracy
Random Forest (SMOTE)
0.9896
Balanced Acc
Balanced RF
0.6641
Precision
Random Forest (SMOTE)
0.3889
Recall
Balanced RF
0.4189
F1
Random Forest (SMOTE)
0.1522
ROC-AUC
Balanced RF
0.8454
PR-AUC
Random Forest (SMOTE)
0.1069
Lowest False Positive Rate
Random Forest (SMOTE)
0.15%
Lowest Miss Rate
Balanced RF
58.11%
EDR – Metrics by Model
EDR – ROC Curves
EDR – Precision–Recall Curves
EDR – Predicted Probability Distributions
EDR – Threshold Sweep
EDR: Logistic Regression – Detailed Analysis
EDR – Logistic Regression: Confusion Matrix
EDR – Logistic Regression: Confusion Matrix
EDR – Logistic Regression: Classification Report
| Model | precision | recall | f1 | support |
|---|---|---|---|---|
| 0 | 0.9914 | 0.9708 | 0.9810 | 7459.0000 |
| 1 | 0.0480 | 0.1486 | 0.0726 | 74.0000 |
| accuracy | nan | nan | 0.9627 | 7533.0000 |
EDR – Logistic Regression: Feature Importance
EDR – Logistic Regression: Feature Importance
EDR: Random Forest (SMOTE) – Detailed Analysis
EDR – Random Forest (SMOTE): Confusion Matrix
EDR – Random Forest (SMOTE): Confusion Matrix
EDR – Random Forest (SMOTE): Classification Report
| Model | precision | recall | f1 | support |
|---|---|---|---|---|
| 0 | 0.9911 | 0.9985 | 0.9948 | 7459.0000 |
| 1 | 0.3889 | 0.0946 | 0.1522 | 74.0000 |
| accuracy | nan | nan | 0.9896 | 7533.0000 |
EDR – Random Forest (SMOTE): Feature Importance
EDR – Random Forest (SMOTE): Feature Importance
EDR: LightGBM – Detailed Analysis
EDR – LightGBM: Confusion Matrix
EDR – LightGBM: Confusion Matrix
EDR – LightGBM: Classification Report
| Model | precision | recall | f1 | support |
|---|---|---|---|---|
| 0 | 0.9909 | 0.9981 | 0.9945 | 7459.0000 |
| 1 | 0.3000 | 0.0811 | 0.1277 | 74.0000 |
| accuracy | nan | nan | 0.9891 | 7533.0000 |
EDR – LightGBM: Feature Importance
EDR – LightGBM: Feature Importance
EDR: Balanced RF – Detailed Analysis
EDR – Balanced RF: Confusion Matrix
EDR – Balanced RF: Confusion Matrix
EDR – Balanced RF: Classification Report
| Model | precision | recall | f1 | support |
|---|---|---|---|---|
| 0 | 0.9937 | 0.9094 | 0.9497 | 7459.0000 |
| 1 | 0.0438 | 0.4189 | 0.0794 | 74.0000 |
| accuracy | nan | nan | 0.9046 | 7533.0000 |
EDR – Balanced RF: Feature Importance
EDR – Balanced RF: Feature Importance
EDR: SGD SVM – Detailed Analysis
EDR – SGD SVM: Confusion Matrix
EDR – SGD SVM: Confusion Matrix
EDR – SGD SVM: Classification Report
| Model | precision | recall | f1 | support |
|---|---|---|---|---|
| 0 | 0.9914 | 0.9788 | 0.9851 | 7459.0000 |
| 1 | 0.0651 | 0.1486 | 0.0905 | 74.0000 |
| accuracy | nan | nan | 0.9707 | 7533.0000 |
EDR – SGD SVM: Feature Importance
EDR – SGD SVM: Feature Importance
EDR: IsolationForest – Detailed Analysis
EDR – IsolationForest: Confusion Matrix
EDR – IsolationForest: Confusion Matrix
EDR – IsolationForest: Classification Report
| Model | precision | recall | f1 | support |
|---|---|---|---|---|
| 0 | 0.9911 | 0.9859 | 0.9885 | 7459.0000 |
| 1 | 0.0708 | 0.1081 | 0.0856 | 74.0000 |
| accuracy | nan | nan | 0.9773 | 7533.0000 |
EDR – IsolationForest: Feature Importance
Feature importance not available for this model type.
XDR: Dataset Loading & Preprocessing
XDR – Train/Test Overview
• Train shape: (88089, 34) | Test shape: (7533, 34)
• Total train samples: 88,089 | Total test samples: 7,533
• Number of features: 32
• Target column: 'label'
• Missing values (train): 0 | (test): 0
• Train shape: (88089, 34) | Test shape: (7533, 34)
• Total train samples: 88,089 | Total test samples: 7,533
• Number of features: 32
• Target column: 'label'
• Missing values (train): 0 | (test): 0
XDR – Train Class Distribution
• 0: 87,232
• 1: 857
• Class balance (minority/majority): 0.9824%
• 0: 87,232
• 1: 857
• Class balance (minority/majority): 0.9824%
XDR – Feature Preparation
• Target encoding: {0: 0, 1: 1}
• Data preprocessing: Infinite values handled, missing values filled with train medians
• Feature scaling: StandardScaler (fit on train, applied to test)
• Target encoding: {0: 0, 1: 1}
• Data preprocessing: Infinite values handled, missing values filled with train medians
• Feature scaling: StandardScaler (fit on train, applied to test)
⚠️ Extreme Class Imbalance Detected
• Minority class represents only 0.9824% of the data
• This extreme imbalance may cause models to predict everything as majority class
• Consider: more aggressive SMOTE ratios, cost-sensitive learning, or ensemble methods
• Metrics like Precision-Recall AUC and F1 are more meaningful than accuracy
• Minority class represents only 0.9824% of the data
• This extreme imbalance may cause models to predict everything as majority class
• Consider: more aggressive SMOTE ratios, cost-sensitive learning, or ensemble methods
• Metrics like Precision-Recall AUC and F1 are more meaningful than accuracy
Baseline (Most-Frequent) Accuracy: 0.9902
XDR: Model Performance Comparison
XDR – Model Performance Metrics
| Model | Accuracy | Balanced Acc | Precision | Recall | F1 | ROC-AUC | PR-AUC |
|---|---|---|---|---|---|---|---|
| Logistic Regression | 0.8250 | 0.6374 | 0.0252 | 0.4459 | 0.0477 | 0.6560 | 0.0459 |
| Random Forest (SMOTE) | 0.9898 | 0.5399 | 0.4000 | 0.0811 | 0.1348 | 0.6897 | 0.1147 |
| LightGBM | 0.9899 | 0.5132 | 0.3333 | 0.0270 | 0.0500 | 0.8440 | 0.0773 |
| Balanced RF | 0.9190 | 0.6781 | 0.0533 | 0.4324 | 0.0950 | 0.8459 | 0.0597 |
| SGD SVM | 0.8852 | 0.5942 | 0.0263 | 0.2973 | 0.0484 | nan | nan |
| IsolationForest | 0.9870 | 0.5185 | 0.1000 | 0.0405 | 0.0577 | nan | nan |
Confusion Matrix Analysis
| Model | TN | FP | FN | TP | FP Rate | Miss Rate |
|---|---|---|---|---|---|---|
| Logistic Regression | 6182 | 1277 | 41 | 33 | 17.12% | 55.41% |
| Random Forest (SMOTE) | 7450 | 9 | 68 | 6 | 0.12% | 91.89% |
| LightGBM | 7455 | 4 | 72 | 2 | 0.05% | 97.30% |
| Balanced RF | 6891 | 568 | 42 | 32 | 7.61% | 56.76% |
| SGD SVM | 6646 | 813 | 52 | 22 | 10.90% | 70.27% |
| IsolationForest | 7432 | 27 | 71 | 3 | 0.36% | 95.95% |
Best Models by Metric
Accuracy
LightGBM
0.9899
Balanced Acc
Balanced RF
0.6781
Precision
Random Forest (SMOTE)
0.4000
Recall
Logistic Regression
0.4459
F1
Random Forest (SMOTE)
0.1348
ROC-AUC
Balanced RF
0.8459
PR-AUC
Random Forest (SMOTE)
0.1147
Lowest False Positive Rate
LightGBM
0.05%
Lowest Miss Rate
Logistic Regression
55.41%
XDR – Metrics by Model
XDR – ROC Curves
XDR – Precision–Recall Curves
XDR – Predicted Probability Distributions
XDR – Threshold Sweep
XDR: Logistic Regression – Detailed Analysis
XDR – Logistic Regression: Confusion Matrix
XDR – Logistic Regression: Confusion Matrix
XDR – Logistic Regression: Classification Report
| Model | precision | recall | f1 | support |
|---|---|---|---|---|
| 0 | 0.9934 | 0.8288 | 0.9037 | 7459.0000 |
| 1 | 0.0252 | 0.4459 | 0.0477 | 74.0000 |
| accuracy | nan | nan | 0.8250 | 7533.0000 |
XDR – Logistic Regression: Feature Importance
XDR – Logistic Regression: Feature Importance
XDR: Random Forest (SMOTE) – Detailed Analysis
XDR – Random Forest (SMOTE): Confusion Matrix
XDR – Random Forest (SMOTE): Confusion Matrix
XDR – Random Forest (SMOTE): Classification Report
| Model | precision | recall | f1 | support |
|---|---|---|---|---|
| 0 | 0.9910 | 0.9988 | 0.9949 | 7459.0000 |
| 1 | 0.4000 | 0.0811 | 0.1348 | 74.0000 |
| accuracy | nan | nan | 0.9898 | 7533.0000 |
XDR – Random Forest (SMOTE): Feature Importance
XDR – Random Forest (SMOTE): Feature Importance
XDR: LightGBM – Detailed Analysis
XDR – LightGBM: Confusion Matrix
XDR – LightGBM: Confusion Matrix
XDR – LightGBM: Classification Report
| Model | precision | recall | f1 | support |
|---|---|---|---|---|
| 0 | 0.9904 | 0.9995 | 0.9949 | 7459.0000 |
| 1 | 0.3333 | 0.0270 | 0.0500 | 74.0000 |
| accuracy | nan | nan | 0.9899 | 7533.0000 |
XDR – LightGBM: Feature Importance
XDR – LightGBM: Feature Importance
XDR: Balanced RF – Detailed Analysis
XDR – Balanced RF: Confusion Matrix
XDR – Balanced RF: Confusion Matrix
XDR – Balanced RF: Classification Report
| Model | precision | recall | f1 | support |
|---|---|---|---|---|
| 0 | 0.9939 | 0.9239 | 0.9576 | 7459.0000 |
| 1 | 0.0533 | 0.4324 | 0.0950 | 74.0000 |
| accuracy | nan | nan | 0.9190 | 7533.0000 |
XDR – Balanced RF: Feature Importance
XDR – Balanced RF: Feature Importance
XDR: SGD SVM – Detailed Analysis
XDR – SGD SVM: Confusion Matrix
XDR – SGD SVM: Confusion Matrix
XDR – SGD SVM: Classification Report
| Model | precision | recall | f1 | support |
|---|---|---|---|---|
| 0 | 0.9922 | 0.8910 | 0.9389 | 7459.0000 |
| 1 | 0.0263 | 0.2973 | 0.0484 | 74.0000 |
| accuracy | nan | nan | 0.8852 | 7533.0000 |
XDR – SGD SVM: Feature Importance
XDR – SGD SVM: Feature Importance
XDR: IsolationForest – Detailed Analysis
XDR – IsolationForest: Confusion Matrix
XDR – IsolationForest: Confusion Matrix
XDR – IsolationForest: Classification Report
| Model | precision | recall | f1 | support |
|---|---|---|---|---|
| 0 | 0.9905 | 0.9964 | 0.9935 | 7459.0000 |
| 1 | 0.1000 | 0.0405 | 0.0577 | 74.0000 |
| accuracy | nan | nan | 0.9870 | 7533.0000 |
XDR – IsolationForest: Feature Importance
Feature importance not available for this model type.